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A METHOD AND DEVICE FOR HIGH QUALITY CODING 
OF'WIDEBANE) SPEECH AN© AUDIO SIGNALS - 

BACKGROUND OF THE INVENTION 

1. Field of the invention: 

The present invention relates to an efficient technique 
fordigij£lly$encoc!^ 
exclusively*a-8peech*sigrjal^ 
synthesizing4hisiwide£^ 



2. Brieftdeseriptiowof4h*eiprio^art#e 

The demand for efficient digital wideband speech/audio 
encoding techniques with a good subjective quality/bit rate trade-off is 
increasing for numerous applications such as audio/video 
teleconferencing, multimedia, and wireless applications, as well as 
Internet and packet network applications. Until recently, telephone 
bandwidths filtered in the range 200-3400 Hz were mainly used in speech 
coding applications. However, there is an increasing demand for 
widebafHfcspeeehwappJictrtidns^^ 
naturalnessvometspeecmte^ 
was-foundftsuffiientefo^ 
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audio signals, this range gives an acceptable audio quality, but still lower 
than the CD quality which operates on the range 20-20000 Hz. 

A speech encoder converts a speech signal into a digital 
bitstream which is transmitted over a communication channel (or stored 
5 in a storage medium). The speech signal is digitized (sampled and 
quantized with usually 16-bits per sample) and the speech encoder has 
the role of representing these digital samples with a smaller number of 
bits while maintaining a good subjective speech quality. The speech 
decoder or synthesizer operates on the transmitted or stored bit stream 
1 0 and converts it back to a sound signal. 



One of the best prior art techniques capable of achieving 
a good quality/bit rate trade-off is the so-called Code Excited Linear 
Prediction (CELP) technique. According to this technique, the sampled 



15 speech signal is processed in successive blocks of L samples usually 
called frames where L is some predetermined number (corresponding to 
10-30 ms of speech). In CELP, a linear prediction (LP) filter is computed 
and transmitted every frame. The L-sample frame is then divided into 
smaller blocks called subframes of size N samples, where L=kN and k is 

20 the number of subframes in a frame (N usually corresponds to 4-10 ms 
of speech). An excitation signal is determined in each subframe, which 
usually consists of two components: one from the past excitation (also 
called pitch contribution or adaptive codebook) and the other from an 
innovation codebook (also called fixed codebook). This excitation signal 

25 is transmitted and used at the decoder as the input of the LP synthesis 
filter in order to obtain the synthesized speech. 
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,„ ^"Jn^pvation codebook in *e CELP context, is an 
indexed,sefcotAteample!fong.!eq 

dimensional.eodeveetors- Each codebook sequenee»is*indexediby*an 
integer A- ranging front- 1 to M where M^represents^the^sizetofethe*, 
codebooteofteniexptessed as»a numbe&pfibits^b, wrror»*/lf£2$£" 



To synthesize speech according to the CELP technique, 
each block of N samples is synthesized by filtering an appropriate 
codevector from a codebook through time varying filters modeling the 
spectral characteristics of the speech signal. At the encoder end, the 
synthetic output is computed for all, or a subset, of the codevectors from the 

^ e ^ k / < ^ b ^ search )- Tne retained codevector is the one producing 
the syrdfoetteJouti^dc^ a 

perceptuallyjlwe^ This^perceptua|»weightirigiis 
periq,rcnedfcu si^^ 



derivedifromctfieitifyfilter^ 



band $ou0d«isigna^ wjde 
range of applications, especially in digital cellular applications. In the 
telephone band, the sound signal is band-limited to 200-3400 Hz and 
sampled at 8000 samples/sec. In wideband speech/audio applications, the 
sound signal is band-limited to 50-7000 Hz and sampled at 16000 
samples/sec. 



Some difficulties arise when applying ,the^teleprSene,bariCI 
optimizedi®EI!iai^^ 
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telephone-band signals, which results in precision problems when a fixed- 
point implementation of the algorithm is required (which is essential in 
wireless applications). Further, the CELP model will often spend most of its 
encoding bits on the low-frequency region, which usually has higher energy 
contents, resulting in a low-pass output signal. To overcome this problem, 
5 the perceptual weighting filter has to be modified in order to suit wideband 
signals, and pre-emphasis techniques which boost the high frequency 
regions become important to reduce the dynamic range, yielding a simpler 
fixed-point implementation, and to ensure a better encoding of the higher 
frequency contents of the signal. Further, the pitch contents in the spectrum 
10 of voiced segments in wideband signals do not extend over the whole 
spectrum range, and the amount of voicing shows more variation compared 
to narrow-band signals. Thus, it is important to improve the closed-loop pitch 
analysis to better accommodate the variations in the voicing level. 



15 At the decoder side, the CELP model uses post-filtering and post- 

processing techniques in order to improve the perceived synthesized signal. 
These techniques have to be changed to accomodate wideband signals. 
Further, in order to lower the bit rate below 16 kbit/s, an efficient method is 
to down-sample the wideband signals, which enables the encoder to operate 

20 on a bandwidth lower than 7000 Hz, thus achieving a reduction in the bit 
rate. At the decoder side, the decoder signal is upsampled and an efficient 
high frequency generation technique is needed to recover the full band 
signal, while maintaining a quality close to the original signal. 



25 
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OBJECTS OF THE INVENTION 




An object of the present invention is therefore to provide»a*rnethod 
and device for effictently encoding wideband (7000 Hz)'sound^signals % using 
5 CELP-type encoding techniques, using additional features at both encoder 
and decoder in order to obtain high a quality reconstructed sound signal, 
which is also suitable for fixed point algorithmic implementation. 




10 SUMMARY OF THE INVENTION 



More specifically, in accordance with the present invention, there 
is provided a method fcwxencoding wideband sound»sigpals using LP-based, 
1 5 preferably CELP^ype^encftding^ec^ 

features are adopted in order to obtain highisubjective quality of the decoded 
wideband sound'signal:^ 

1. The overall perceptual weighting of the quantization error is 

20 obtained by a combination of a preemphasis filter and a modified weighting 
filter. 

In CELP-type coders, the optimum pitch and innovation 
parameters are searched by minimizing the mean squared error between the 
25 input speech and synthesized speech in a perceptually weighted domain. 
This is equivalent to minimizing.the'error betweenithe weighted^inputspeech 
and weighted synthesis speech, where^the weighting is performed using a 
filter having a transfer function; kV^z) of the^brm:^ 
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W®=4?ly x )/4?ly^ where 0<y 2 <y x <l 

In analysis-by-synthesis (AbS) coders, analysis show that the quantization 

error is weighted by the inverse of the weighting filter, W X {z), which 

exhibits some of the formant structure in the input signal. Thus, the 
5 masking property of the human ear is exploited by shaping the error, so 
that it has more energy in the formant regions, where it will be masked by 
the strong signal energy present in those regions. The amount of 

weighting is controlled by the factors y x and y 2 . 

1 0 This filter works well with telephone band signals. However, it was 

found that this filter is not suitable for efficient perceptual weighting when 
it was applied to wideband signals. It was found that this fin 
inherent limitations in modeling the formant structure and the required 
spectral tilt concurrently. The spectral tilt is more pronounced in 

15 wideband signals due to the wide dynamic range between low and high 
frequencies. It was suggested to add a tilt filter into filter W{z) in order to 
control the tilt and formant weighting separately. 

A novel solution to this problem, forming part of the present 
20 invention, is to introduce a preemphasis filter at the input, compute the LP 
filter A(z) based on the preemphasized speech, and use a modified filter 
W{z) by fixing its denominator. 



25 



The preemphasis filter reduces the dynamic range of the input 
signal, which renders it more suitable for fixed-point implementation, and 
improves the encoding of the hjgh frequency contents of the spectrum. The 
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preemphasis is obtained by a fixed FIR filter having a transfer function P(z) 
in the form: 

where \i is a preemphasis factor with a value between 0 and 1 . A higher 
order filter can also be used. Linear prediction (LP) analysis is performed 
on the preemphasized input signal to obtain the LP fitter A(z). A new 
weighting filter is used, which has a transfer function of the form: 

»(z)=^(z/7 1 )/(l-/ 2 z- 4 ) where 0<y 2 </, <1 

Note that'be^use^iz^iscomputed.basedyonipreemphasized.speeeh, 

the^tt^t^ltiii^i^,! is lessipr^Bpun^i^mpajBditolthetcase^ 
when*A(*)iis<c©mpirted«ba^^ 

using*he»fittet#iHp^^ 

quantization error spectrum is shaped by the filter ^(z^'Cz). When fd 

is set equal to Y 2 , which is typically the case, the spectrum of the 

quantization error is shaped by the filter l/^z//,), with A{z) computed 

based on the preemphasized speech. Subjective listening showed that 

this structure* of achieving the^error, shapjog^by^a .combinati^of 

preemphasistandimodifietlw^ 

widebandisigials*im*addffiSiJ^ 

algorithmiefimpte^ 
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2. The closed-loop pitch analysis is improved to better 

accommodate wideband signals. 

The pitch harmonics in AbS coders are usually modeled using 
5 a pitch delay T and an associated gain b. The excitation signal u(n) is 
derived by adding the past excitation at delay T scaled by a gain b to an 
innovation component from a fixed codebook scaled by a gain g. That is 

10 

where v r (w) is the past excitation at delay T samples. For an improved 

performance, a fractional delay is usually used. In this case, the past 
excitation is oversampled to achieve the required higher resolution^to- 
most cases, the pitch predictor can be represented by a filter having a 

15 transfer function of the form 1/(1 -fof 7 ), whose spectrum has a harmonic 

structure over the entire frequency range, with a harmonic frequency 
related to 1/7. In case of wideband signals, this structure is not very 
efficient since the harmonic frequencies don't cover the entire extended 
spectrum. The harmonic structure exists only up to a certain frequency, 

20 depending on the speech segment. A new method which achieves 
efficient modeling of the harmonic structure of the speech spectrum uses 
several forms of low pass filters applied to the past excitation and the one 
yielding higher prediction gain is selected. When subsample pitch 
resolution is used, the low pass filters can be incorporated into the 

25 interpolation filters used to obtain the higher pitch resolution. 
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3. At the decoder, the innovative contribution to the excitation is 

enhanced^ by filtering* It through a preemphaws^filteRwhpse^effiefents 
are derived from the level of voicing in speech segement in thesubframe. 

Enhancing the periodicity of the excitatiQn signal imprpvesrthe 
5 quality in case of voiced segments. This was done in the past by filtering 
the innovation from the fixed codebook through a filter having a transfer 

function of the form 1/(1 -dz~ T ) where s is a factor below 0.5 which 

controls the amount of introduced periodicity. This approach is less 
efficient in case of wideband signals since it introduces the periodicity 
over the entire spectrum. A new alternative approach is disclosed 
whereby the^periodiGity^enhaneeme achieved^by- filteringMhe 
innovative signal from the fixed codebook by a filter which emphasizes 
the high frequencies and reduces the low-frequency contents of^the 
innovation, and whose^coetTiciients^areirelated to«the^level offperiodicity^ 
in the signal s In this approach, the innovative contribution is reduced 
mainly at tow frequenciesfwhi^enhan 
at low frequencies^more than high frequencies. 

4. A new high-frequency generation procedure is introduced in 

order to recover the high frequency content of the signal, in case the input 
signal has been down-sampled. 

In order to improve the coding efficiency and reduce the 
algorithmic complexity of the wideband coding algorithm, the input 
wideband signal is downvsampled*from 1 6 kHz-to around*12.a kHz, This v 
reduces the number-of;6amples-in a^^ 

time, and reduces th^sigr^ ( 



15 



20 



25 
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rate down to 12 kbit/s while keeping very high quality decoded sound 
signal. At the decoder, the high frequency contents of the signal needs 
to be reintroduced to remove the low pass filtering effect from the 
decoded signal and retrieve the natural sounding quality of wideband 
signals. A new approach consists of generating the high frequency 
5 contents by filling the upper part of the spectrum with a white noise 
properly scaled in the excitation domain, then converted to the speech 
domain, preferably but not necessarily by shaping it with the same LP 
filter used for synthesizing the down-sampled signal. 

10 The objects, advantages and other features of the present 

invention will become more apparent upon reading of the following non 
restrictive description of a preferred embodiment thereof, given by way of 
example only with reference to the accompanying drawings. 



15 



20 



BRIEF DESCRIPTION OF THE DRAWINGS 



In the appended drawings: 

Figure 1 is a schematic block diagram of a preferred 
embodiment of a wideband encoding device embodying the present 
invention; 



25 Figure 2 is a schematic block diagram of a preferred 

embodiment of a wideband decoding device embodying the present 
invention, and comprising a method for high frequency generation; and 



2252170 1 9M&t0m^ 



11 



^ . .Figure 3 is a schematic block diagram of a closed-loop pitch 
anary8l8»ile\ri^9^.uital^»fSp vfaefolfatsigijgtellf 

DETAI liED, PESCRIKFION-OF THBIgRiEERRlDiEMBOJIMijgpfc 

The novel techniques disclosed in the present specification may 
apply to different LP (Linear Prediction)-based coding systems. However, 
a CELP-type coding system is used in the preferred embodiment for 
presenting a non limitative illustration of the techniques disclosed herein. 

F^TlBrt* showsi*a~genera^ 
speec^aeneodin v g*devic«>modifie'd*to^ 
signals. 



The«sampled*input,!»spee(^wte^^^^ 
called^amesfe* ln«eaGhtfrarne#differer^ 
speectosigiaaliiflrth^^^ 

parameters representing the LP synthesis filter are usually computed 
once every frame. The frame is further divided into smaller blocks of 
length N, in which excitation parameters (pitch and innovation) are 
determined. In the CELP literature, these blocks of length N are called 
"subframes" and the A/-sample signals in a subframe are referred to as N- 
dimensional vectors. In this preferred embodiment, the length N 
corresponds to 5 ms while the length L corresponds to 20 ms/whieh 
means thltfatfrarne*conta^ 

of 1 6 kHz«aiwi*64*aftel*down^amp^^ VanqusJ*^ 
dimensio^ttee^teToG^^ 
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which appear in Figures 1 and 2 as well as a list of transmitted 
parameters are given herein below: 

List of the main Aldimensional vectors 

5 s Input speech vector (after down-sampling, pre-processing, 

and preemphasis); 
s w Weighted speech vector; 
s 0 Zero-input response of weighted synthesis filter; 
x Target vector for pitch search; 
10 h Impulse response of the combination of synthesis and 

weighting filters; 
v r Adaptive codebook vector at delay T; 
y r Filtered adaptive codebook vector (v r convolved with h); 

x' Target vector for pitch search; 

15 Innovation codevector at index k (fc-th entry from the 

innovation codebook); 
c f Enhanced scaled innovation codevector; 
u Excitation signal (scaled innovation and pitch codevectors); 
u' Enhanced excitation; 
20 8' Synthesis signal before deemphasis; and 

s„ Synthesis signal after deemphasis and postprocessing. 



List of transmitted parameters 

25 

STP Short term prediction parameters (defining A(z)); 
T Pitch lag (or adaptive codebook index); 

b Pitch gain (or adaptive codebook gain); 
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j Index of the low-pass filter used on the pitch 

codejyje^6i!i£ 

k CodeyectoRindext(innovatiomeodeppoteentry>)i|and^ 
g ,t InnovationiGodebooktgainir 

In this preferred embodiment, the STP parameters are transmitted once 
per frame and the rest of the parameters are transmitted four times per 
frame (every subframe). 

ENCODING PRINCIPLE 

The sampled speech signal is encoded on a block by block 
basiSKby**he«»ncodir^ 
elevenimodules.numberedsfrorrMOl tovl 1 1. 



TH%input*speech^ 
sampleiblqcks.calledrfframesr.ta, 

Re^fhng*tojli^^ 
in a down-sampling module 101. In this preferred embodiment, the signal 
is down-sampled from 16 kHz down to 12.8 kHz, using techniques well 
known in the art. Down-sampling increases the coding efficiency, since 
a smaller bandwidth is encoded. This also reduces the algorithmic 
complexity since the number of samples in a frame is decreased. The 
use of down-sampling becomes significant as the bit rate is reduced 
below 16 kbit/s, although down-sampling is not essential-above 16 kbitfe. 

Afteliwdownssaiinp^ 
reducedttot256$saja^}fiairm^ 




15 



20 



25 
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The input frame is then passed into the optional pre-processing 
block 102, which consists of a high pass filter with a 50 Hz cut-off 
frequency. High-pass filter 102 removes the unwanted sound 
components below 50 Hz. 

The down-sampled pre-processed signal is denoted by s p (n), 
n=0, . . . ,L-1 , where L is the length of the frame (256 at 1 2.8 kHz sampling). 
In preemphasis 103, the signal s p (n) is preemphasized using a filter 
having the following transfer function: 



where // is a preemphasis factor with a value between 0 and 1 (a typical 
value is /i=0.7). A higher order filter can also be used. 



1 5 Note that the high-pass filter 1 02 and preemphasis filter 1 03 can 

be interchanged to obtain more efficient fixed-point implementations. 

The function of the preemphasis filter 103 is to reduce the dynamic 
range of the input speech signal, which renders it more suitable for fixed- 
20 point implementation. Without preemphasis, it is difficult to implement LP 
analysis in fixed-point using single-precision arithmetic. 

Preemphasis also plays an important role in achieving a proper 
overall perceptual weighting of the quantization error, which contributes 
25 to an improved sound quality. This will be explained later in more details. 
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The output of the preemphasis filter 103 is denoted s(n). This 
signal is usefer^eij^ 

art- The autocorrelation^ approach, is used? ^ where^the^sigojalgis^first^ 
windowed«using a Hamming window (usuallydn the order*of*30^Otms)r 
The^autoGorrelations*are-cx)mput^ . 
Levinson-Durbin recursion is used to compute the LP parameters, a„ 
where M,... t p, and where p is the LP order, which is typically 16 in 
wideband coding. The parameters a, are the coefficients of the transfer 
function of the LP filter 



^(z) = l+|W< 



Ml T~ 



LP analysisais *pe>rformed>*in module^ 04?*^ich^ 

quantizatioraand interpolate 

transformed intoYanother*equK/alent«ddmaini^ 

and-interpolatiowpurposes*^ 

spectral pair^(ISB)^omains*aret4^^ 

interpolation can be efficiently performed. The 16 LP parameters can be 
quantized in the order of 30 to 50 bits using split or multi-stage quantization, 
or a combination thereof. The purpose of the interpolation is to enable 
updating the LP parameters every subframe while transmitting them once 
every frame, which improves the coder performance without increasing the 
bit rate. 

The^folldwing^paragr^ 
operations performed on a subframe basis*ln the>folloWi^ 
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filter A(z) denotes the unquantized interpolated LP filter in the subframe, and 
the fitter A?) denotes the quantized interpolated LP filter in the subframe. 

Perceptual Weighting: 

In analysis-by-synthesis coders, the optimum pitch and innovation 
parameters are searched by minimizing the mean squared error between 
the input speech and synthesized speech in a perceptually weighted 
domain. This is equivalent to minimizing the error between the weighted 
input speech and weighted synthesis speech. 

The weighted signal s w {ri) is computed in a weighted signal 
generator 105. Traditionally, the weighted signal sJin) is computed by a 
weighting filter having a transfer function W[z) in the form 



J^z)=4z/y x )/4z/y 2 ) where 0<y 2 <y x <L 

In analysis-by-synthesis (AbS) coders, analysis shows that the 
quantization error is weighted by a transfer function, ^(z), which is the 

inverse of the transfer function of the filter 105. Transfer function W*(z) 

exhibits some of the formant structure in the input signal. Thus, the 
masking property of the human ear is exploited by shaping the error, so 
that it has more energy in the formant regions, where it will be masked by 
the strong signal energy present in those regions. The amount of 

weighting is controlled by the factors Y\ and Y % . 
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The above traditional weighting filler works well with telephone band 
signal How^ 

efficient perceptual weighting when it was applie\j*to.widebandiisigQals^ 
It was found thaMhis fitter hasM'nherert 
structure-and*the«i»qujiB(^ 

more pronounced in wideband signals due to the wide dynamic range 
between low and high frequencies. The prior art has suggested to add 
a tilt filter into W(z) in order to control the tilt and formant weighting 
separately. 

1 0 A novel solution to this problem, which is part of the present invention, 

is to introduce the preemphasis filter 103 at the input, compute the LP filter 
A(z) bas^on*^ 
W[z) byifixingiitsideaoiQinatoj^ 



L^anafysisiis^perforrnediiin-m 
signal s$ii)*tooM^ 
1 05 withSfixedtdenominateiMt 

mz)=A(z/r l )/(l-r2z' 1 ) where 0<y 2 <y,<\. 

is used (a higher order can be used at the denominator). This form 
decouples the formant weighting from the tilt. 





25 



Note that because A(z) is computed based on the preemphasized 
speech*sigijialii^)«ith^ 
comparedptb»thel<^iwh^ 
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speech. Since deemphasis is made at the receiver end using a filter 
having a transfer function P~ l (z)=l/(l-^' } ) 9 the quantization error 

spectrum is shaped by a filter having a transfer function W A (z)F' x {z). 

When M is set equal to f 2 , which is typically the case, the spectrum of 
the quantization error is shaped by a filter whose transfer function is 
5 1/ A? I y x ) % with A(z) computed based on the preemphaslzed speech. 

Subjective listening showed that this structure of achieving the error 
shaping by a combination of preemphasis and modified weighting filtering 

is very effcicient for en codin g wideband signals, in addition to the 

advantages of ease of fixed-point algorithmic implementation. 

10 



Pitch Analysis: 

In order to simplify the pitch analysis, an open-loop pitch lag is first 
estimated in the open-loop pitch search module 106 using the weighted 
speech signal sjn). Then the closed-loop pitch analysis which is 
performed in closed-loop pitch search module 107 on a subframe basis 
is restricted around the open-loop pitch lag which significantly reduces the 
search complexity of the LTP parameters 7 and b (pitch lag and pitch 
gain). Open-loop pitch analysis is usually performed once every 10 ms 
(two subframes) using techniques well known in the art. 

The target signal for LTP (Long Term Prediction) analysis, x, is first 
computed. This is usually done by subtracting the zero-input response 
of a weighted synthesis filter W(z)/A(z) (calculated by a zero-input 



15 



20 



25 
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response generator 108) from the weighted speech signal s w (n). More 
specifically, the.target>vect0fvx-is calculate*! usiog t tfee*fbtlowing,ceia^n: 



5 where x is the A/-dimensional target vector, 8„, is the weighted signal 
vector in the subframe, and s 0 is the rero-input response of the filter 
W(z)/A(z) which is the output of the combined filter W(z)/A(z) due to its initial 
states. 8 0 is computed in the zero-input response generator 1 08. 

1 0 Just a word to mention that alternative, but mathematically equivalent 

approaches'Ganibe^ 

A AkJimensional.impulse response vector h of the.weighted synthesis^ 
filter Wfz^fz^is computed, in tne impu^iresponse*gejerator!#t09Sli 

15 

The' dosed-loop*«pitch< or adaptiye*codebook*»parameters are 
computed in-the closed^loop3pitch.search«module 407?which»usesithe targe_t » 
vector x and the impulse response vector h as inputs. Traditionally, the pitch 
prediction was represented by a pitch filter having the following transfer 
20 function: 



where b is the pitch gairt»and T is the.prtj^^elayie.ti,lag^ln thisSclsei*the^ 

25 pitch contributjoh to ^excitation signak/(n)<is ^ giveniby^/i^^^^rejthe^ 
total excitation-is givemby-*> 
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Z<«) = &<«-7)+ g C Jfe (/l) 

with g being the innovative codebook gain and c^n) the innovation 
codevector at index k. 

5 This representation has limitations if the delay T is shorter than the 
subframe length A/. In another view point, the pitch contribution can be seen 
as an adaptive codebook containing the past excitation signal. Generally, 
each vector in the adaptive codebook is a shift-by-one version of the 
previous vector (discarding one sample and adding a new sample). For 
1 0 delays T>N, the adaptive codebook is equivalent to the filter structure, and 
a codevector Vj(n) is given by 



v T (n)=U(n-7) t n=0,...,A/-1. 



15 For delays shorter than T, a codevector is built by repeating the available 
samples from the past excitation until the codevector is completed (this is not 
equivalent to the filter structure). 

In recent coders, a higher pitch resolution is used which significantly 
20 improves the quality of voiced sound segments. This is achieved by 
oversampling the past excitation signal using polyphase interpolation filters. 
In this case, the codevector v^ri) may correspond to an interpolated version 
of the past excitation, with T being a non-integer delay (e.g. 50.25). 



25 



The pitch search consists of finding the best delay T and gain b that 
minimize the mean squared weighted error between the target vector x and 
the scaled filtered past excitation 



CA-02252170 ^1998-10-27^ 



21 



5 



E=\*-by T f 

where%y r is the*filtered adaptive codevector*at'delay*£: 

n 

y T (n) = v T (n)*h(n) = Jv T (i)Kri-f) , n=0 AM . 

It can be shown that the error E is minimized by maximizing the criterion 



10 where denotes*vectorHtranspose«i 



In the^prefem^^ *i/3 " 

subsample^pitc^iiresolution.is^usedr and*the*prtGh-searehHs composed of* 
three stages. 



15 



In the first stage, an open-loop delay is estimated in open-loop pitch 
search module 106. In the second stage, the search criterion C is seached 
in the closed-loop pitch search module 107 for integer delays around the 
estimated open-loop delay (usually ±5) ( which significantly simplifies the 
20 search procedure. A simple procedure is used for updating^ttie^filtertdL. 
codeyector y^v^oufrth^ 
Onde-an optimumiint^er*dela^ 
delay aretestetfsinrth^ 
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When the pitch predictor is represented by a filter of the form 

l/(l-ftz T ) 9 which is a valid assumption for delays 7>A/, the spectrum of 

the pitch filter exhibits a harmonic structure over the entire frequency 
range, with a harmonic frequency related to 1/7. In case of wideband 
signals, this structure is not very efficient since the harmonic structure in 
wideband signals does not cover the entire extended spectrum. The 
harmonic structure exists only up to a certain frequency, depending on 
the speech segment. Thus, in order to achieve efficient representation 
of the pitch contribution in voiced segments of wideband speech, the pitch 
predictor need to have the flexibility of varying the amount of periodicity 
over the wideband spectrum. 

A new method which achieves efficient modeling of the harmonic 
structure of the speech spectrum is disclosed in the present specification, 
"whereby several forms of low pass filters are applied to the past excitation 
and the one with higher prediction gain is selected. 

When subsampie pitch resolution is used, the low pass filters can 
be incorporated into the interpolation filters used to obtain the higher pitch 
resolution. In this case, the third stage of the pitch search, in which the 
fractions around the chosen integer delay are tested, is repeated for the 
several interpolation filters having different low-pass characteristics and 
the fraction and filter index which maximize the search criterion C are 
selected. 




25 



A simpler approach, is to complete the search in the three stages 
described above, to determine the optimum fractional delay using only 
one interpolation filter with certain frequency response, and select the 
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optimum low-pass filter shape at the end by applying the different pre- 
determined low-pass filters to the chosSrTjadapii^^ 
seleGtthe>iowrpass»fiItertwhich?minM 

Figured shows^a schematic- block^diagram^of^a ^prefelred* 
embodiment of the proposed approach. 

In module 303, the past excitation codevector is memorized. 
Module 301 is responsive to the target vector jr and to the past excitation 
codevector from memory module 303 to conduct a pitch codebook search 
minimizing the above-defined search criterion C. From the result of the 
search conducted in module 301, module 302 generates the optimum 
codeyector^Kf** 

Sup^etf^ 

passipjpb^ndi&a^ 

filtered versions^ v^aiw^ 

fiterstSiJGjftas^ , K.^Tffesejfi^ 

v / } . /=1,.--,K. The different vectors are convolved in modules 304® 

where j=1, ... , K, with the impulse response h to obtain the vectors y 0 *, 

/=1,... f /C The selected frequency shaping filter 305® is the one which 
minimizes the mean squared pitch prediction error 
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value y® is multiplied by the gain b by means of an amplifier 307® and the 
value 6®y® is subtracted from the target vector x by means of subtracters 
308®. 

The gain associated with the frequency shaping filter at index y, 
5 is given by 



10 In the same manner, optimum codevector v 7 is convolved with the impulse 
response h to obtain the vectors y. To calculate the mean squared pitch 
prediction error for y, the value y is multiplied by the gain b by means of an 
amplifier 307® and the value by is subtracted from the target vector x by 
means of subtracters 308. The gain b is given by 

15 

b = *y/\W 



In module 309, the parameters b, T, and j are chosen based on vy or 
vP which minimizes the mean squared pitch prediction error e. 
20 The pitch codebook index T is encoded and transmitted. The pitch 

gain b is quantized and transmitted. With this new approach, extra 
information is needed to encode the index j of the selected frequency 
shaping filter, tf two filters are used, then one bit is needed to represent this 
information. 



25 
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Innovative codebook search: 

Onceiihepitehi«GrLTP (Long Term.Predie*on)jparairwteir8fA i C ( and* 
j are determined, we proceed by searching Jbr the*optimum*innovative* 
ex(^o#<(^mea^^^ &&FWygfe Firs^he^terget ve^i^c is 

updated by subtracting the LTP contribution: 

i=x-£y r 

where b is the pitch gain and y r is the filtered adaptive codebook vector (the 
past excitation at delay T filtered with the selected low pass filter and 
convolvediwithithe^inpu^ 
3). 



excitation'Codevector ci ( and»gain»g which minimize the«meanrsquaredierror 
between»the>target vector and^thescaledifiltered codeveetQR 

E=\\*-gH: k f 

where H is a lower triangular convolution matrix derived from the impulse 
response vector h. 

In the preferred embodiment of the present invention, the innovative 
ccdebook«seareh*fe-pertoTOed*^ 

codebook as described»in US- patent*numbers~5,444;8»1B (Adoulaefeal.) 
issued onAugust;2^1.995l«5,69®r482lfranteato 
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17, 1997; 5,754,976 granted to Adouletal., on May 19, 1998; and 5,701,392 
(Adoul et al.) dated December 23, 1997. 

Once the optimum codevector and its gain are chosen by module 
1 10, the codebook index k and gain g are encoded and transmitted. 

5 

Referring to Figure 1, the parameters b, T, j, A(z), k and g are 
multiplexed through a multiplexer 112 before being encoded and tranmitted 

1 0 Memory update: 

In module 111 (Figure 1), the states of the weighted synthesis 

filter are updated by filtering the excitation signal u=gz k +£v r through 

theweightedsyrithesls-fi end of this tittering, the states of the 

15 filter are memorized and used in the next subframe as initial states for 
computing the zero-input response in generator module 108. 

Similar to the target vector, other alternative, but mathematically 
equivalent, approaches can be used to update the filter states. 

20 

DECODING PRINCIPLE 

The speech decoding device of Figure 2 illustrates the various steps 
25 carried out between the digital input 222 (input to the demultiplexer 217) and 
the output sampled speech 223 (output of the adder 221). 
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The demultiplexer 217 extracts the synthesis model parameters from 
the.binary information received from a digital input channels FromFeach 
received binary frame, the extracted parameters are: 

- the short-term prediction parameters STP (orK» per*frarne); 

- the long-term prediction (LTP) parameters T, b, and (for each 
subframe); and 

- the innovation codebook index k and gain g (for each subframe). 

^The current speech signal fe synthesized based on these parameters as will 
be expteif^herembeiew** ,-. 

The innovativeexcitation generator 218 is responsive*© the-index k 
to produceitr^rnnovation^codevector^ whtehsis sealed.by the*decoded 
gain factor g through an amplifier 224. H In the preferred embodiment, an 
algelMstessaxaebe^assdese^ 

numbers^5,444 f 8l6; 5,699?482; 5,754i976r and^ 5,701,392 is used to 
represent the innovative excitation. 

The generated scaled codevector at the output of the amplifier 224 
is processed through a frequency-dependent pitch enhancer 205. 

Enhancing the periodicity of the excitation signal improves the 
quality in case of voiced segments. This was done in the past by filtering 
the innovatiohfcfromHhe^ 

VQ-dsd) whete#<ismifaetOBbe^ 
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introduced periodicity. This approach is less efficient in case of wideband 
signals since it introduces the periodicity over the entire spectrum. A new 
alternative approach, which is part of the present invention, is disclosed 
whereby the periodicity enhancement is achieved by filtering the 
innovative signal from the fixed codebook by a filter F(z) whose frequency 
response emphasizes the higher frequencies more than lower 
frequencies. The coefficients of F(z) are related to the amount of 
periodicity in the signal. An efficient way to derive the filter coefficients is 
to relate them to the amount of pitch contribution to the total excitation. 
This results in a frequency response depending on the subframe 
periodicity, where higher frequencies are more strongly emphasized 
(stronger overall slope) for higher pitch gains. This filter has the effect of 
lowering the energy of the innovative excitation at low frequencies when 
the signal is more periodic, which enhances the periodicity of the 
excitation at lower frequencies more than higher frequencies. Suggested 
T5 forms of this filter are 

(1) F(z)=l-a' 1 or (2) F{z)=-cz+\-ar x 

where a or a are factors derived from the level of periodicity of the signal. 
20 The second 3-tape form of F(z) is used in this preferred embodiment The 
factor a is computed in the voicing factor generator 204 as follows: 
The ratio of pitch contribution to the total excitation is first computed by 




tt ' u " £>•«> 
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where v 7 is the pitch cddebook vector,, b is the pitch gairv^and u is^the 
excitation vector given at the output of the adder 2 19 by 

Just a word to mention that the term bv T is produced by the pitch 
codebook 201 in response to the pitch lag T and the past value of u 
stored in memory 203. The adaptive codeveclor from the pitch codebook 
201 is then processed through a low-pass filter whose cut-off frequency 
is adjusted by means of the index j from the demultiplexer 217. The 
resulting codevector v 7 is then multiplied* by the^gain**g frornf the" 
demultiplexer 217 through an amplifier 226 to obtain the signal bv^ 

The^actor'a is*giyen*by 

bounded by a<q 

where q is a factotwhichrcon^^^ (q is set 

to 0.25 in this preferred embodiment). 

The enhanced signal c, is computed by filtering the scaled 
innovative vector gc* through the enhancing filter F(z). 

The enhanced excitation signal u* is computed by the adder 220 

as 
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Note that this process is not performed at the encoder. Thus, it is 
essential to update the content of the adaptive codebook using the 
excitation without enhancement to keep synchronism between the 
encoder and decoder. Therefore, the excitation signal u is used to update 
the memory of the adaptive codebook and the enhaced excitation signal u' 
5 is used at the input of the LP synthesis filter 206. 

The synthesized signal s' is computed by filtering the enhanced 
excitation signal u' through the LP synthesis filter 206 which has the form 
VA(z), where A(z) is the interpolated LP filter in the current subframe. As 
can be seen in Figure 2, the LP coefficients 225 from the demultiplexer 217 
are supplied to the LP filter 206 to adjust the parameters of the LP filter 206 
accordingly. The deemphasis filter 207 is the inverse of the preemphasis 
filter 1 03 of Figure 1 . The transfer function of the preemphasis filter 108 is 
given by 




The vector s r is filtered through the deemphasis filter D(z) (module 
207) to obtain the vector s d , which is passed through the postprocessing 
20 module 208 comprising a high-pass filter to remove the unwanted 
frequencies below 50 Hz. 

The over-sampling module 209 conducts the inverse process of the 
down-sampling module 101 of Figure 1. In this preferred embodiment, 
25 oversampling converts from the 12.8 kHz sampling rate to the original 16 kHz 
sampling rate, using techniques well known in the art. The oversampled 
synthesis signal is denoted A 
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The synthesis signal does not contain the ^higher frequency 
compontenfe wfii6K ^veralqstiiy the do^^m^ , 
Figured) aMhe<encoders* ThiSigivest a, lowrpass*percept*Gntofithe.synthesis^ 
speech. To restore*the*fullrband«of the*original«sigi!ialf*a highjrfrequency^ 
genera^iprc^ui»<isdisdosed*Jhb>procedure 
5 212 through 216 of Figure 2. 

In this new approach, the high frequency contents are generated by 
filling the upper part of the spectrum with a white noise properly scaled in the 
excitation domain, then converted to the speech domain, preferably by 
1 0 shaping it with the same LP filter used for synthesizing the down-sampled 
signal. 

THe'hig^frequeney^^ , 
invention, is detailed hereinbelow: % 

15 

Th^randolnmoiseigejrerato^ 
w ' w ^a*flat^pectrtjm^ 
techniquoa^ltokh^ 

which is the subframe length in the original domain. Note that N is the 
20 subframe length in the down-sampled domain. In this preferred 
embodiment, N=64 and N'=80 which correspond to 5 ms. 

The white noise sequence is properly scaled in the gain adjusting 
module 214. Gain adjustment comprises the following steps. First, the 
25 energy of the generated noise sequence is set equal to the energy of the 
enhancedtextitat^ . 
210, and»th&resu^ 
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/i=0 l „. i A/ , -1 



The second step in the gain scaling is to take into account the voicing 
of the synthesized signal at the output of generator 204 so as to reduce the 
energy of the generated noise proportional to the voicing. In this preferred 
embodiment, this is implemented by measuring the tilt of the synthesis signal 
through a spectral tilt calculator 212 and reducing the energy accordingly. 
When the tilt is very strong, which corresponds to voiced segments, the 
noise energy is further reduced. The tilt factor is computed In module 212 
as the first correlation coefficient of the synthesis signal s h and it is given by 



10 



tilt = ^Sm~T~ , bounded by tilt * 0 and tilt zr^ 



r v is given by 

r v = (^~^)/(^+£ c ) where E v is the eneigy of the scaled pitch 
1 5 codevector and E c is the energy of the scaled innovative codevector. £ is 
mostly less than tilt but this bound was introduced as a precaution against 
high frequency tones where the tilt value is high and the value of r v is small. 
So this bound reduces the noise energy for such tonal signals. 



20 



The tilt value is 0 in case of flat spectrum and 1 in case of strongly 
voiced signals. The scaling factor derived from the tilt is given by 
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& =10^ 

When the tilt is close to zero, the scaling factor is dose to 1 f which 
does not result in energy- reduction. When, the*tilt value is 1 , the*scaling 
factor results in a reduction of 12 dB in the energy of the generated noise. 

Once the noise is properly scaled, it is brought into the speech 
domain using the spectral shaper 215. In the preferred embodiment, this is 
achieved by filtering the noise through a bandwidth expanded version of the 
same LP synthesis filter used in the down-sampled domain (1/A(z/0.8)). 

TTie^fittd^'scal^noise^^ the 
required frequency range ; to be*restoredku^ In 
the preferred^ embodiment, the band-pass filter* 21 6 restricts the noise 



sequeiit^rlo the frequency range 5.6-7.2 kHz.*^11ie*resujting*band-p9ss 
15 noise sequence z is added*to the oversampled*synthesized speech signal 
s to obtair^th&finalire^ 

Although the present invention has been described hereinabove by 
way of a preferred embodiment thereof, this embodiment can be modified 
20 at will, within the scope of the appended claims, without departing from the 
spirit and nature of the subject invention. 
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